Abstract
Acute myeloid leukemia (AML) represents a heterogeneous group of hematologic malignancies requiring highly individualized treatment strategies that integrate molecular profiles, cytogenetic data, and patient-specific clinical factors. The emergence of artificial intelligence (AI) and large language models (LLMs) presents an opportunity to enhance clinical decision-making. Our quality improvement (QI) based project evaluates the concordance between treatment recommendations and clinical outcome predictions generated by LLMs compared to actual clinical decisions made by experienced hematology-oncology physicians. Additionally, we assess physician agreement with AI-suggested treatment plans to explore the potential utility of LLMs as clinical decision support tools in the management of AML patients.
We conducted a retrospective chart review of 20 patients diagnosed and treated with AML at our tertiary care institution from July 2021 to March 2025. De-identified patient data including age, sex, comorbidities, ECOG performance status, laboratory values, imaging results, bone marrow biopsy findings with immunophenotyping, cytogenetic analysis, and next-generation sequencing (NGS) was entered into four popular LLMs OpenEvidence (a medical-specific AI platform), ChatGPT-4 (OpenAI), Gemini 2.5 (Google), and Claude Opus (Anthropic). AI tools were tasked to generate treatment recommendations, predict induction hospital stay length, estimate overall prognosis and in-hospital outcomes including remission status, hospice transition, and treatment-related mortality. AI-generated results were compared to actual treatments delivered and observed clinical outcomes. Additionally, to assess physician perception of AI recommendations, we administered a structured survey to 10 hematology-oncology physicians (fellows and attendings) who evaluated the AI-generated treatment plans using a validated 5-point Likert scale (1=strongly disagree to 5=strongly agree). Physicians were blinded to both the specific LLM source and the actual treatment decisions to minimize bias.
The treatment plans proposed by the evaluated LLMs demonstrated concordance with actual clinical management, matching physician decisions in 16 of 20 patients (80%) . Similarly, overall prognosis predictions showed comparable accuracy, aligning with prognosis predicted by treating physician in 16 of 20 cases (80%). However, the LLMs demonstrated significantly lower accuracy in predicting operational and short-term clinical metrics, correctly estimating length of hospital stay in only 9 of 20 cases (45%) and accurately predicting in-hospital outcomes in 11 of 20 cases (55%).
Physician evaluation of LLM performance yielded favorable ratings across all clinical prediction domains on the 5-point Likert scale. Overall mean scores across all four LLMs were: treatment recommendations 4.02 (95% CI: 3.44-4.59), overall prognosis prediction 4.01 (95% CI: 3.52-4.50), in-hospital outcomes 3.98 (95% CI: 3.43-4.53), and length of stay prediction 3.84 (95% CI: 3.22-4.45). Model-specific analysis revealed that OpenEvidence performed better than other LLMs in three critical domains: prognosis prediction with mean score of 4.52 (95% CI: 4.01-5.00), in-hospital outcomes prediction with mean of 4.32 (95% CI: 3.83-4.81), and treatment recommendations with mean of 4.18 (95% CI: 3.51-4.85). Claude Opus demonstrated superior performance specifically in length of stay prediction. Notably, all evaluated domains across all models maintained mean physician agreement scores above 3.8, indicating consistently positive physician assessment of LLM capabilities in clinical prediction tasks. Most physicians (80%) reported being open to using AI in clinical decision making after reviweing the survery.
Our QI project demonstrates that LLMs show promise as clinical decision support tools for AML management, with high concordance between AI and physician treatment decisions and favorable physician ratings across all domains. However, the lower accuracy in predicting operational metrics (length of stay and in-hospital outcomes) identifies key areas for improvement. We believe that lower accuracy in operational metrics reflects non-clinical factors: insurance approvals, disposition, nursing ratios, and consultant availability. LLMs are trained on ideal guidelines and hence are not able to integrate these complex real-world variables yet.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal